Mortality models in the NFI plots

Author

Juliette Archambeau

Published

January 16, 2024

Code
data <- readRDS(here("data/NFIdata/NFIdata_cleaned.rds"))

1 Introduction

Goal: estimate the association between the genomic offset and the mortality rates in natural populations.

1.1 Causal model

flowchart LR
  C[Competition among trees] --> M[Mortality rates]
  DBH[Average tree age] --> M
  DBH <--> C
  GO[Genomic offset] --> M
  Country --> M
  CS[Census interval] --> M
  S[Spatial autocorrelation] --> C
  S --> GO
  S --> M
  S --> DBH
  style S fill:#D5F5E3,stroke:#2ECC71
  style M fill:#F5B7B1,stroke:#E74C3C

The mortality rates in the NFI plots may be influenced by:

  • Competition among trees. To account for it, we use as proxy the variable \(C_{i}\), which is the basal area of all trees (all species confounded) in the plot \(i\).

  • Average tree age. To account for it, we use as proxy the variable \(DBH_{i}\), which is the mean diameter at breast height (DBH) of all maritime pines in the plot \(i\), including adults, dead trees and seedlings/saplings (ie recruitment).

  • Predicted genomic offset at the location of the plot \(i\). Proxy of the potential maladaptation experienced by the trees at this location.

  • Country. Indeed, the French and Spanish inventories present noticeable methodological differences that may bias the estimations, so we may want to estimate country-specific coefficients: the country-specific intercepts \(\beta_{0,c}\) and the country-specific slopes \(\beta_{C,c}\), \(\beta_{GO,c}\) and \(\beta_{BA,c}\). This is was I did during my PhD.

  • Census interval, that we account with an offset and a complementary log-log link function (see justification below).

  • Spatial autocorrelation among plots. To account for it, we may use Gaussian processes, following section 14.5 of Statistical Rethinking of Richard McElreath.

During my PhD, I also included the potential influence of fires on mortality rates. For that, I used as proxy the variable \(BA_{i}\), which is the estimated monthly burned area from the GFED4 database at the location of the plot \(i\). Note that mortality events due to fires should not be recorded in the NFI. However, high mortality rates are observed in Galicia (see exploratory analyses in report 8_ValidationNFI) and one potential explanation may be the higher fire-related mortality in this region. In the following analyses though, I decided not to incorporate fires as a potential cause of mortality rates, but this is something we can discuss!

Mathematical model

In order to account for the different census intervals between inventories, we modeled the proportion \(p_{i}\) of maritime pines that died in the plot \(i\) during the census interval with the complementary log-log link and an offset on the logarithm of the census interval \(\Delta_{i}\) for the plot i, as follows:

\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= X\beta + \text{log}(\Delta_i) \\ \end{align*}\]

with \(N_{i}\) the total number of maritime pines in the plot \(i\), \(m_{i}\) the number of maritime pines that died during the census interval \(\Delta_{i}\) in the plot \(i\), \(X\) is the matrix of the intercept and the explanatory variables, \(\beta\) is the matrix of the coefficients associated with the explanatory variables (and a column of 1 for the intercept).

1.3 Mathematical models compared

We check if the models are able to recover the simulated parameter values (simulated under the assumption that the model is true). To do this, we start with a very simple model and we gradually make it more complex.

1.3.2 Model M1: adding the genomic offset

\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \text{log}(\Delta_i) \\ \end{align*}\]

with \(\Delta_{i}\) the census interval for the plot \(i\). In this model, \(m_{i}\) is the number of maritime pines that died during the census interval \(\Delta_{i}\).

1.3.3 Model M2: adding a second predictor

\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \beta_C C_i + \text{log}(\Delta_i) \\ \end{align*}\]

with \(C_{i}\) the basal area of all trees (all species confounded) in the plot \(i\) (to account for the competition between trees) and its associated slope \(\beta_C\)

1.3.4 Model M3: adding country-specific fixed effects

\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0,c} + \beta_{C} C_i + \beta_{GO} GO_i + \text{log}(\Delta_i) \\ \end{align*}\]

with \(\beta_{0,c}\) the country-specific intercept, \(C_{i}\) the basal area of all trees (all species confounded) in the plot \(i\) (to account for the competition between trees) and its associated slope \(\beta_{C}\), \(GO_{i}\) the estimated genomic offset in the plot \(i\) and its associated slope \(\beta_{GO}\).

1.3.5 Model M4: adding country-specific varying effects

We might want to use country-specific varying intercepts (instead of fixed intercepts) but I thought it would add non-necessary complexity to the model (and there are only two countries to estimate of the variance among country-intercepts..). Finally, I did not include this model in the model comparison.

Non-centered version of the model (with the \(z\)-score) to facilitate convergence.

\[\begin{align*} m_{i} &\sim \text{Binomial}(N_{i},p_{i})\\ \text{log}(-\text{log}(1-p_{i})) &= z_{\beta_{0,c}} \sigma_{\beta_{0}} + z_{\beta_{C,c}} \sigma_{\beta_{C}} C_{i} + z_{\beta_{GO,c}} \sigma_{\beta_{GO}} GO_{i} + \text{log}(\Delta_{i}) \\[4pt] \begin{bmatrix} z_{\beta_{0,c}} \\ z_{\beta_{C,c}} \\ z_{\beta_{GO,c}} \end{bmatrix} & \sim \text{MVNormal}\left(\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix},\mathbf{R} \right) \\[4pt] \begin{bmatrix} \sigma_{\beta_{0}} \\ \sigma_{\beta_{C}}\\ \sigma_{\beta_{GO}} \end{bmatrix} & \sim \text{Exponential}(1) \\[4pt] \mathbf{R}& \sim \text{LKJcorr(4)} \end{align*}\]

1.3.6 Model M5: adding spatial autocorrelation

We will use spatial Gaussian processes with a squared exponential covariance kernel to model the spatial processes.

Materials used to build this model:

1.3.6.1 M5_1: Spatial autocorrelation alone

We first consider a model with only a clog-log link and the spatial processes to determine whether we are able to accurately estimate the spatial autocorrelation parameters.

\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \mu_i\\ \end{align*}\]

The intercepts \(\mu_i\) capture the spatial effects. They have a multivariate Gaussian prior such as:

\[\begin{align*} \mu_i & \sim \text{MVNormal}(0,\mathbf{K})\\ K_{ij} &= \alpha^{2} \exp \left( - \frac{ D_{ij}^2 }{2\rho^2} \right) + \delta_{ij} \sigma^2 \end{align*}\]

\(\mathbf{K}\) is the covariance matrix of the \(\mu_i\) intercepts and belongs to the squared exponential family (the exponentiated quadratic covariance function in Stan’s manual. We can also talk about a squared exponential covariance kernel. This function says is that the covariance between any two spatial points \(i\) and \(j\) declines exponentially with the squared distance between them.

The parameter \(\rho\) determines the rate of decline, i.e. it controls how quickly the correlations between spatial points decay as a function of the distance between them. If it is large, then covariance declines rapidly with squared distance. \(\rho\) is often called the length scale parameter.

\(\alpha^{2}\) is the maximum covariance between any two spatial points \(i\) and \(j\) (Statistical Rethinking, McElreath). The parameter \(\alpha\) controls the marginal variability of the spatial function at all points; in other words it controls how much the spatial term contributes to the linear predictor (Stan Geostatistical Model of Nicholas Clark).

\(\delta_{ij} \sigma^2\) provides for extra covariance beyond \(\alpha^{2}\) when \(i = j\). It does this because the function \(\delta_{ij}\) is equal to 1 when \(i = j\) but is zero otherwise (Statistical Rethinking, McElreath). In our study, this term does not matter because we only have one observation for each NFI plot. But if we had more than one observation per plot, \(\sigma\) would describe how these observations covary.

So, in our study, when \(i=j\), \(K_{ij} = \alpha^{2}\).

Priors

In Statistical Rethinking, McElreath, define priors for the square of \(\rho\), \(\alpha\) and \(\sigma\) and estimate them on the same scale, because that’s computationally easier. Like in our study, he doesn’t need \(\sigma\) in his model, so he fixes it at an irrelevant constant. As \(\rho^{2}\) and \(\alpha^{2}\) must be positive, he uses exponential priors for them: \(\rho^{2} \sim \text{Exponential(0.5)}\) and \(\alpha^{2} \sim \text{Exponential(2)}\).

In Stan Geostatistical Model, Nicholas Clark uses: \(\rho \sim \text{Normal(2,2.5)}\) (very small values will not be identifiable, so he says that this prior is an informative prior) and \(\alpha \sim \text{Normal(0,1)}\).

In our study, the distances (in km) among plots are very large and so we use:

\[\begin{align*} \rho & \sim \text{Normal}(500,200)\\ \alpha & \sim \text{Normal}(0,1) \end{align*}\]

We check what the prior distributions imply for the covariance functions.

Code
plot(NULL, xlab="Distance (in km)" , ylab= "Covariance", xlim=c(0,2150), ylim=c(0,6))

n_curve <- 50
rho <- rnorm(n_curve, 500, 200)
alpha <- rnorm(n_curve, 0, 01)

for ( i in 1:n_curve ) curve(alpha[i]^2*exp(-0.5 * ((x / rho[i])^2))  , add=T, col = col.alpha("darkgreen",0.5))

Problem: this model is already long to fit on 200 simulated spatial points, so it would not be possible to fit it on the 11917 NFI plots… So, I decided not to include spatial autocorrelation into the models. However, I checked that the model M3 was able to accurately estimate the association between the mortality rates and the genomic offset predictions on simulated data with spatial autocorrelation. The rational was that, if the model M3 shows reasonable accuracy on simulated data with auto-correlation, we could use the model M3 to estimate the association between the mortality rates and the genomic offset predictions even if this model does not account for spatial autocrrelation.

1.3.6.2 M5_2: Merging with the other variables

If it had been possible to fit the M5_1 model to the NFI plots, the next steps would have been to incorporate genomic offset predictions and basal area into the model, such as: \[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0,c} + \beta_{C,c} C_i + \beta_{GO,c} GO_i + \text{log}(\Delta_i) + \mu_i\\ \end{align*}\]

However, as I did not manage to fit the M5_2 model to the full dataset, I have not evaluated this model in this report.

1.3.7 Model M6: adding a third predictor

We add to model M3 a third predictor.

1.3.7.1 M6_1: without interaction

\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \beta_C C_i + \beta_{DBH} DBH_i + \text{log}(\Delta_i) \\ \end{align*}\]

with \(DBH_{i}\) the mean diameter at breast height (DBH) of the maritime pines in the plot \(i\) (including adults, dead trees and recruitment trees) and its associated slope \(\beta_{DBH}\).

1.3.7.2 M6_2: with interaction

M6_2 models the interaction between \(C_i\) and \(DBH_{i}\).

\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \beta_C C_i + \beta_{DBH} DBH_i + \beta_{C*DBH} C_i DBH_i + \text{log}(\Delta_i) \\ \end{align*}\]

2 Simulated data

To determine whether the models are able to recover the true parameters (i.e. the simulate parameters), we perform 100 simulations:

Code
nb_simulations <- 100
Code
# Some functions to show model coefficients and coverage across simulations

make_coeff_table <- function(mod,pars=NULL,true_params){

  conf95 <- broom.mixed::tidyMCMC(mod,
                                  pars=pars,
                                  droppars = NULL, 
                                  robust = FALSE, # give mean and standard deviation
                                  ess = T, # effective sample size estimates
                                  rhat = T, # Rhat estimates
                                  conf.int = T, conf.level = 0.95 # include 95% credible intervals
                                  ) %>% 
    dplyr::rename(conf95_low=conf.low,
                  conf95_high=conf.high,
                  mean=estimate,
                  std_deviation=std.error)
  
  
  broom.mixed::tidyMCMC(mod,
                        pars=pars,
                        droppars = NULL, 
                        robust = TRUE, # give median and mean absolute deviation (= avg distance btw each data point and the mean)
                        ess = F, rhat = F, 
                        conf.int = T, conf.level = 0.80 # include 80% credible intervals
                        ) %>% 
    dplyr::rename(conf80_low=conf.low,
                  conf80_high=conf.high,
                  median=estimate,
                  mean_absolute_deviation=std.error) %>% 
     inner_join(conf95,by=c("term")) %>% 
     dplyr::select(term,
                   mean, mean_absolute_deviation,
                   median, std_deviation,
                   conf80_low,conf80_high,
                   conf95_low,conf95_high, 
                   rhat,ess) %>% 
  mutate(true_values = true_params) %>% 
  mutate(conf80_coverage = if_else(conf80_low<=true_values & conf80_high>=true_values, TRUE,FALSE),
         conf95_coverage = if_else(conf95_low<=true_values & conf95_high>=true_values, TRUE,FALSE))
  
}
   
calc_coverage_across_sims <- function(x){

  x %>% 
  group_by(term) %>% 
  group_split() %>% 
  purrr::map(\(x){
    data.frame(parameter=unique(x$term),
               conf80_coverage=sum(x$conf80_coverage == TRUE),
               conf95_coverage=sum(x$conf95_coverage == TRUE))
    }) %>% 
    bind_rows()}

2.2 M1: adding the genomic offset

Code
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m1.stan"))
print(stancode)
S4 class stanmodel 'anon_model' coded as follows:
data {
  int N;
  vector[N] log_nb_years;                                                           // Offset to account for different census intervals
  vector[N] GO;                                                                     // Genomic offset
  int nb_dead[N];                                                                   // Number of dead trees in the plot
  int nb_tot[N];                                                                    // Total number of trees in the plot
}
parameters {
  real beta_0;
  real beta_GO;
}
model {
  nb_dead ~ binomial(nb_tot,inv_cloglog(beta_0 + beta_GO * GO + log_nb_years));

  beta_0 ~ normal(0, 1);
  beta_GO ~ normal(0, 1);
} 
Code
set.seed(494442)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot


# Data simulation
lapply(nb_setseed,function(seed){
 
set.seed(seed) 

nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
beta_0 <- runif(1,-6.5,-5.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
eta <- beta_0 + beta_GO * GO + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 log_nb_years=log(nb_years))


# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0,save_warmup = FALSE) 


# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(beta_0,beta_GO))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m1.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m1.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
beta_0 76 94
beta_GO 82 94

2.3 M2: adding a second predictor

Code
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m2.stan"))
print(stancode)
S4 class stanmodel 'anon_model' coded as follows:
data {
  int N;
  vector[N] C;                                                                    // Proxy of the competition among trees (i.e. basal area)
  vector[N] log_nb_years;                                                           // Offset to account for different census intervals
  vector[N] GO;                                                                     // Genomic offset
  int nb_dead[N];                                                                   // Number of dead trees in the plot
  int nb_tot[N];                                                                    // Total number of trees in the plot
}
parameters {
  real beta_0;
  real beta_GO;
  real beta_C;
}
model {
  nb_dead ~ binomial(nb_tot,inv_cloglog(beta_0 + beta_GO * GO + beta_C * C + log_nb_years));

  beta_0 ~ normal(0, 1);
  beta_GO ~ normal(0, 1);
  beta_C ~ normal(0, 1);
} 
Code
# Parameters to estimate
params_to_estimate <- c("beta_0","beta_GO","beta_C")

2.3.1 Balanced design

Code
set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot

# Data simulation
lapply(nb_setseed,function(seed){
  
set.seed(seed) 
  
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
beta_0 <- runif(1,-7.5,-6.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area (capturing the competition among trees)
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- beta_0 + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 log_nb_years=log(nb_years))



# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0,save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(beta_0,beta_GO,beta_C))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_balanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_balanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
beta_0 76 90
beta_C 76 91
beta_GO 82 96

2.3.2 Unbalanced design

Code
set.seed(420453)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots

# Data simulation
lapply(nb_setseed,function(seed){

set.seed(seed) 
  
nb_tot <-sample(5:30, N, replace=T)  # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
beta_0 <- runif(1,-7.5,-6.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- beta_0 + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0,save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(beta_0,beta_GO,beta_C))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_unbalanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_unbalanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
beta_0 71 86
beta_C 75 91
beta_GO 82 94

2.3.3 True design

Code
set.seed(4453)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot  # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates


# Data simulation
lapply(nb_setseed,function(seed){
  
set.seed(seed)

beta_0 <- runif(1,-7.5,-6.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area (capturing the competition among trees)
C <- (C - mean(C)) / sd(C) # scaling the basal area

eta <- beta_0 + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(beta_0,beta_GO,beta_C))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_truedesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_truedesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
beta_0 74 94
beta_C 80 94
beta_GO 78 96

2.4 M3: adding country-specific fixed effects

Code
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m3.stan"))
print(stancode)
S4 class stanmodel 'anon_model' coded as follows:
data {
  int N;
  vector[N] log_nb_years;                                                           // Offset to account for different census intervals
  vector[N] C;                                                                      // Proxy of the competition among trees (i.e. basal area)
  vector[N] GO;                                                                     // Genomic offset
  int nb_dead[N];                                                                   // Number of dead trees in the plot
  int nb_tot[N];                                                                    // Total number of trees in the plot
  int<lower=0> nb_country;                                                          // Number of countries
  int<lower=0, upper=nb_country> country[N];                                        // Countries
}
parameters {
  vector[nb_country] alpha_country;
  real beta_GO;
  real beta_C;
}
model {
  nb_dead ~ binomial(nb_tot,inv_cloglog(alpha_country[country] + beta_GO * GO + beta_C * C  + log_nb_years));

  alpha_country ~ normal(0, 1);
  beta_GO ~ normal(0, 1);
  beta_C ~ normal(0, 1);
} 
Code
# Parameters to estimate
params_to_estimate <- c("alpha_country","beta_GO","beta_C")

2.4.1 Balanced design

Code
set.seed(4945424)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
nb_country <- 2
country <- c(rep(1,N/2),rep(2,N/2))

# Data simulation
lapply(nb_setseed,function(seed){
  
set.seed(seed)

nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))


# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars=params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_balanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_balanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 66 86
alpha_country[2] 76 90
beta_C 79 92
beta_GO 83 96

2.4.2 Unbalanced design

Code
set.seed(53)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_country <- 2 # nb of countries
country <- c(rep(1,N/2),rep(2,N/2))


# Data simulation
lapply(nb_setseed,function(seed){

set.seed(seed) 

nb_tot <-sample(5:30, N, replace=T)  # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)

alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area

GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars=params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_unbalanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_unbalanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 64 88
alpha_country[2] 67 86
beta_C 71 88
beta_GO 79 94

2.4.3 True design

Code
set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot  # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates


# Data simulation
lapply(nb_setseed,function(seed){
  
set.seed(seed)

nb_country <- length(unique(data$country))
country <- as.numeric(data$country)
alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts

beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area

eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_truedesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_truedesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 80 93
alpha_country[2] 74 92
beta_C 83 97
beta_GO 82 99

2.4.4 True data with random GO

In this set of simulations, we use the NFI data but we replace the genomic offset predictions at the location of the NFI plots by a randomly generated variable such as \(x \sim \mathcal{N}(0,1)\).

Code
run_randomGO <- function(stancode,data){

random_GO <- rnorm(nrow(data),mean=0,sd=1) 

datalist <- list(N=nrow(data),
                 nb_dead=data$nb_dead,
                 nb_tot=data$nb_tot,
                 GO=(random_GO - mean(random_GO)) / sd(random_GO),
                 C=(data$basal_area-mean(data$basal_area))/sd(data$basal_area),
                 nb_country=length(unique(data$country)),
                 country=as.numeric(data$country),
                 log_nb_years=log(data$nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Model coefficients
conf95 <- broom.mixed::tidyMCMC(mod,
                                pars=params_to_estimate,
                                droppars = NULL, 
                                robust = FALSE, # give mean and standard deviation
                                ess = T, # effective sample size estimates
                                rhat = T, # Rhat estimates
                                conf.int = T, conf.level = 0.95 # include 95% credible intervals
                                ) %>% 
    dplyr::rename(conf95_low=conf.low,
                  conf95_high=conf.high,
                  mean=estimate,
                  std_deviation=std.error)
  
  
  broom.mixed::tidyMCMC(mod,
                        pars=params_to_estimate,
                        droppars = NULL, 
                        robust = TRUE, # give median and mean absolute deviation (= avg distance btw each data point and the mean)
                        ess = F, rhat = F, 
                        conf.int = T, conf.level = 0.80 # include 80% credible intervals
                        ) %>% 
    dplyr::rename(conf80_low=conf.low,
                  conf80_high=conf.high,
                  median=estimate,
                  mean_absolute_deviation=std.error) %>% 
     inner_join(conf95,by=c("term")) %>% 
     dplyr::select(term,
                   mean, mean_absolute_deviation,
                   median, std_deviation,
                   conf80_low,conf80_high,
                   conf95_low,conf95_high, 
                   rhat,ess)
   

}

set.seed(495)

lapply(1:nb_simulations, function(x) run_randomGO(stancode,data)) %>% saveRDS(here(paste0("outputs/ValidationNFI/Simulations/m3_truedata_randomGO_",nb_simulations,"simulations.rds")))
Code
# Function to count the number of simulations in which the 95% and 80% credible intervals overlap with zero. 
count_overlapp_with_zero_across_sims <- function(x){

  x %>% 
  group_by(term) %>% 
  group_split() %>% 
  purrr::map(\(x){
    data.frame(parameter=unique(x$term),
               conf80_include_zero=sum(x$conf80_include_zero == TRUE),
               conf95_include_zero=sum(x$conf95_include_zero == TRUE))
    }) %>% 
    bind_rows()}

# Extract the outputs and identify the simulations in which the 95% and 80% credible intervals overlap with zero. 
df <- readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/m3_truedata_randomGO_",nb_simulations,"simulations.rds"))) %>% 
  bind_rows(.id="sim_ID") %>% 
  dplyr::filter(term == "beta_GO") %>%
  mutate(conf80_include_zero = if_else(conf80_low<=0 & conf80_high>=0, TRUE,FALSE),
         conf95_include_zero = if_else(conf95_low<=0 & conf95_high>=0, TRUE,FALSE))

# How many times the 95% and 80% credible intervals include zero?
df %>% 
  count_overlapp_with_zero_across_sims %>% 
  kable_mydf()
parameter conf80_include_zero conf95_include_zero
beta_GO 36 44
Code
# Show the mean and 95% / 80% credible intervals with interval plots
df %>% 
  ggplot(aes(mean, reorder(sim_ID, mean))) +
  geom_vline(xintercept = 0, color="gray30") +
  geom_errorbar(aes(xmin = conf95_low, xmax = conf95_high), color="#009E73") +
  geom_errorbar(aes(xmin = conf80_low, xmax = conf80_high), color="orange") +
  geom_point() +
  theme_bw() +
  labs(x="Effect size of a randomly generated variable", y="ID of the simulations") +
  theme(axis.text.y = element_text(size=6))

2.5 M5: adding spatial autocorrelation

2.5.1 Balanced design

The code below is mostly based on the tutorial Stan Geostatistical Model of Nicholas Clark.

We first generate some fake spatial points within the spatial range of the NFI plots.

Code
# Generating coordinates of spatial points based on the geographical range of the NFI plots
# =========================================================================================

N <- 200 # nb of plots
lat <- runif(n = N, min = min(data$latitude), max = max(data$latitude))
lon <- runif(n = N, min = min(data$longitude), max = max(data$longitude))
coords <- as.matrix(cbind(lon, lat))
colnames(coords) <- c('Longitude', 'Latitude')

We simulate 200 geospatial coordinates within the geographic area of the NFI plots: latitude and longitude coordinates are sampled based on the minimum/maximum latitudes and longitudes of the NFI plots.

We then calculate distances (in meters) between the points, ensure the resulting matrix is symmetric, and then convert distances to kilometers.

Code
m <- pointDistance(coords, lonlat = TRUE) # calculate distances (in meters) between the points
m[upper.tri(m)] = t(m)[upper.tri(m)] #  ensure the resulting matrix is symmetric
m <- m / 1000 # convert distances to kilometers
m[1:10,1:10] # preview the matrix
           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      [,8]      [,9]     [,10]
 [1,]    0.0000 1187.7291 1066.1934 1335.6785  532.3548 1387.7242  951.8294  957.3336 1249.5280 1073.8229
 [2,] 1187.7291    0.0000  423.3596  157.5926 1336.2816 1129.5277 1319.3939  986.3943 1041.3719  360.9065
 [3,] 1066.1934  423.3596    0.0000  550.7623 1378.6763 1468.1351 1503.1758 1225.1371 1358.6983  700.8530
 [4,] 1335.6785  157.5926  550.7623    0.0000 1452.0022 1126.0241 1391.1026 1043.6973 1054.5010  407.1606
 [5,]  532.3548 1336.2816 1378.6763 1452.0022    0.0000 1058.7853  479.3696  652.2654  937.1620 1082.7107
 [6,] 1387.7242 1129.5277 1468.1351 1126.0241 1058.7853    0.0000  643.4014  432.3474  139.4959  777.6579
 [7,]  951.8294 1319.3939 1503.1758 1391.1026  479.3696  643.4014    0.0000  360.3180  551.8977  984.2593
 [8,]  957.3336  986.3943 1225.1371 1043.6973  652.2654  432.3474  360.3180    0.0000  297.2924  637.2568
 [9,] 1249.5280 1041.3719 1358.6983 1054.5010  937.1620  139.4959  551.8977  297.2924    0.0000  682.4744
[10,] 1073.8229  360.9065  700.8530  407.1606 1082.7107  777.6579  984.2593  637.2568  682.4744    0.0000
Code
# Another way of doing it, which
# gives the same distance matrix:
# library(geodist)
# t <-  coords %>% geodist(measure="geodesic")
# t[1:10,1:10]

In this matrix, the maximum distance is 2045 km.

We then assign values to the parameter \(\rho\) and \(\alpha\) to ensure that there is obvious spatial autocorrelation.

Code
alpha <- 1 # maximum covariance
rho <- 500 # spatial decay of 500 km 

We calculate the squared exponential covariance matrix. A small offset is used on the diagonals to ensure the resulting matrix is positive definite. This is a requirement for simulating multivariate normal draws.

Code
K <- alpha^2 * exp(-0.5 * ((m / rho) ^ 2)) + diag(1e-9, N)
K[1:10,1:10]
            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]       [,7]       [,8]       [,9]      [,10]
 [1,] 1.00000000 0.05952250 0.10294743 0.02821013 0.56733619 0.02124663 0.16333396 0.15993701 0.04404072 0.09964003
 [2,] 0.05952250 1.00000000 0.69874685 0.95154257 0.02811935 0.07795172 0.03075815 0.14285181 0.11430204 0.77066081
 [3,] 0.10294743 0.69874685 1.00000000 0.54515880 0.02233732 0.01342225 0.01089910 0.04969148 0.02491906 0.37441528
 [4,] 0.02821013 0.95154257 0.54515880 1.00000000 0.01474840 0.07919354 0.02085142 0.11319900 0.10818152 0.71780346
 [5,] 0.56733619 0.02811935 0.02233732 0.01474840 1.00000000 0.10624022 0.63154230 0.42703027 0.17264028 0.09589276
 [6,] 0.02124663 0.07795172 0.01342225 0.07919354 0.10624022 1.00000000 0.43695249 0.68808117 0.96182939 0.29834532
 [7,] 0.16333396 0.03075815 0.01089910 0.02085142 0.63154230 0.43695249 1.00000000 0.77131521 0.54379540 0.14405892
 [8,] 0.15993701 0.14285181 0.04969148 0.11319900 0.42703027 0.68808117 0.77131521 1.00000000 0.83797628 0.44388376
 [9,] 0.04404072 0.11430204 0.02491906 0.10818152 0.17264028 0.96182939 0.54379540 0.83797628 1.00000000 0.39394556
[10,] 0.09964003 0.77066081 0.37441528 0.71780346 0.09589276 0.29834532 0.14405892 0.44388376 0.39394556 1.00000000

Let’s visualize the shape of the function relating distance to the covariance \(K_{ij}\):

Code
curve(alpha^2 * exp(-0.5 * ((x / rho) ^ 2)), 
      from=0, to=2150, 
      lty=2, 
      xlab="Distance (in km)" , 
      ylab="Covariance" )

We simulate the latent Gaussian Process function as random draws from a zero-centered multivariate normal distribution with \(K\) as the covariance matrix. Let’s visualize the spatial autocorrelation:

Code
set.seed(4924)
mu <- MASS::mvrnorm(1, mu = rep(0, N), Sigma = K)

ggplot(data.frame(GP = mu,
                  Longitude = lon,
                  Latitude = lat),
       aes(x = Longitude, y = Latitude, fill = GP)) +
  geom_point(size = 5, shape = 21, col = 'white') +
  scale_fill_continuous(type = "viridis") +
  theme_minimal()

2.5.1.1 M5_1

We then simulate the observed counts of dead trees.

Code
set.seed(492)
nb_tot <- rep(30,N) # nb of trees per plot
beta_0 <- -6 # intercept
eta <- beta_0 + mu
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N, nb_tot, p) # nb of dead trees in each plot


ggplot(data.frame(Y = nb_dead,
                  Longitude = lon,
                  Latitude = lat),
       aes(x = Longitude, y = Latitude, fill = Y)) +
  geom_point(size = 5, shape = 21, col = 'white') +
  scale_fill_continuous(type = "viridis", name = "nb dead trees") +
  theme_minimal()

Code
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m5_1_noncentered.stan"))
print(stancode)
S4 class stanmodel 'anon_model' coded as follows:
functions{
  matrix cov_GPL2(matrix x, real alpha, real rho, real delta) {
    real sq_alpha = square(alpha);
    real sq_rho = square(rho);
    int N = dims(x)[1];
    matrix[N, N] K;
    K = sq_alpha * exp(-0.5 *(square(x)/sq_rho)) + diag_matrix(rep_vector(delta, N));;
    return cholesky_decompose(K);
  }
}
data {
  int N;
  int nb_dead[N];                                                                 // Number of dead trees in the plot
  int nb_tot[N];                                                                  // Total number of trees in the plot
  matrix[N, N] Dmat;                                                              // Distance matrix (in km)
}
transformed data {
  real delta = 1e-9; // Small offset to ensure the covariance matrix is positive definite
}
parameters {
  vector[N] eta;
  real beta_0;
  real<lower=0> alpha;
  real<lower=0> rho;
}
transformed parameters {
  matrix[N, N] L_K;
  vector[N] mu;
  L_K = cov_GPL2(Dmat, alpha, rho, delta);
  mu = L_K * eta;

}
model {
  rho ~ normal(500, 200);
  alpha ~ normal(0,1);
  beta_0 ~ normal(-4,2);
  eta ~ std_normal(); // Multiplier for non-centred GP parameterisation

  nb_dead ~ binomial(nb_tot,inv_cloglog(beta_0 + mu));
} 
Code
# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 Dmat=m)

# Running stan model
mod <- sampling(stancode, data = datalist, 
                iter = 2000, 
                chains = 4, cores = 4,
                init=0, 
                save_warmup = FALSE,
                pars=c("mu","beta_0","alpha","rho")) 

mod %>% saveRDS(file=here("outputs/ValidationNFI/Simulations/m5_1.rds"))
Code
mod <- readRDS(file=here("outputs/ValidationNFI/Simulations/m5_1.rds"))

# Model coefficients
broom.mixed::tidyMCMC(mod,
                      pars=c("beta_0","alpha","rho"),
                      droppars = NULL, 
                      conf.level=0.95,
                      estimate.method = "median", 
                      ess = T, rhat = T, conf.int = T) %>% 
  kable_mydf(round_number=3)
term estimate std.error conf.low conf.high rhat ess
beta_0 -5.617 0.593 -7.059 -4.687 1.000 1537
alpha 0.923 0.458 0.145 1.979 1.003 1284
rho 429.562 162.166 165.230 797.061 1.005 1503
Code
make_coeff_table(mod, pars=c("beta_0","alpha","rho"), true_params = c(beta_0,alpha,rho)) %>% 
  kable_mydf()
term mean mean_absolute_deviation median std_deviation conf80_low conf80_high conf95_low conf95_high rhat ess true_values conf80_coverage conf95_coverage
beta_0 -5.62 0.53 -5.69 0.59 -6.47 -5.04 -7.06 -4.69 1 1537 -6 TRUE TRUE
alpha 0.92 0.43 0.95 0.46 0.39 1.56 0.14 1.98 1 1284 1 TRUE TRUE
rho 429.56 159.09 444.00 162.17 243.65 657.37 165.23 797.06 1 1503 500 TRUE TRUE

Let’s look at the scatter plots for \(\rho\), \(\alpha\) and \(\beta_0\).

Code
mcmc_pairs(as.array(mod), 
           np = nuts_params(mod), 
           pars = c("beta_0","alpha","rho"),
           off_diag_args = list(size = 0.75))

Using the code from Stan Geostatistical Model of Nicholas Clark, we visualize the posterior estimates for the spatial covariance kernel to see if the model captures the simulated covariance well.

Code
# We define a function to compute the covariance kernel for each posterior draw
quad_kernel = function(rho, alpha, distances){covs <- alpha^2 * exp(-0.5 * ((distances / rho)^2))}


# We specify the distances (in km) over which to compute the posterior kernel estimates
distances = seq(0, 2150, length.out = 75)

# we compute posterior kernels
kernels <- matrix(NA, nrow = 1000, ncol = length(distances))
list_posteriors <- rstan::extract(mod, pars=c("rho","alpha"))
for(i in 1:1000){
  kernels[i,] <- quad_kernel(rho = list_posteriors$rho[i],
                             alpha = list_posteriors$alpha[i],
                             distances = distances)
}

# Calculate posterior empirical quantiles for the kernel
probs <- c(0.05, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.95)
cred <- sapply(1:NCOL(kernels),
               function(n) quantile(kernels[,n],
                                    probs = probs))

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

# Plot the estimated kernels and overlay the true simulated kernel in black
pred_vals <- distances
plot(1, xlim = c(0, 2150), ylim = c(0, max(cred)), type = 'n',
     xlab = 'Distance (in km)', ylab = 'Covariance',
     bty = 'L')
box(bty = 'L', lwd = 2)
polygon(c(pred_vals, rev(pred_vals)), c(cred[1,], rev(cred[9,])),
        col = c_light, border = NA)
polygon(c(pred_vals, rev(pred_vals)), c(cred[2,], rev(cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(pred_vals, rev(pred_vals)), c(cred[3,], rev(cred[7,])),
        col = c_mid, border = NA)
polygon(c(pred_vals, rev(pred_vals)), c(cred[4,], rev(cred[6,])),
        col = c_mid_highlight, border = NA)
lines(pred_vals, cred[5,], col = c_dark, lwd = 2.5)

# Overlay the true simulated kernel
true_kernel <- quad_kernel(rho = rho,
                           alpha = alpha,
                           distances = distances)
lines(pred_vals,
      true_kernel, lwd = 3.5, col = 'white')
lines(pred_vals,
      true_kernel, lwd = 3, col = 'black')

2.5.1.2 M5_2

We then simulate the observed counts of dead trees.

2.6 M3 + spatial autocorrelation

As I was not able to fit the model M5_1 on a large dataset, I checked that the model M3 was able to recover the true parameter values on simulated data in which I included some autocorrelation among spatial points.

Here is the sample of covariance functions that were likely in the simulations, given the chosen distributions for \(\rho\) and \(\alpha\):

\[\rho \sim \text{Uniform}(50,600)\] \[\alpha \sim \text{Uniform}(0.2,1.5)\]

Code
plot(NULL, xlab="Distance (in km)" , ylab= "Covariance", xlim=c(0,2150), ylim=c(0,3))

n_curve <- 100
rho <- runif(n_curve, 50, 600)
alpha <- runif(n_curve, 0.2, 1.5)

for ( i in 1:n_curve ) curve(alpha[i]^2*exp(-0.5 * ((x / rho[i])^2))  , add=T, col = col.alpha("darkgreen",0.5))

Code
# Stan code of model M3
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m3.stan"))

nb_setseed <- sample(1:1000, nb_simulations, replace=FALSE)

# 100 simulations
lapply(nb_setseed,function(seed){
  
set.seed(seed)
  

N <- 5000 # number of plots
data <- sample_n(data,N,replace=F) # sample N plots among the true NFI plots
nb_tot <- data$nb_tot  # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates
nb_country <- length(unique(data$country))
country <- as.numeric(data$country)

# Distance matrix among plots (in km)
m <- data %>% 
  dplyr::select(latitude,longitude) %>% 
  geodist(measure="geodesic")

m <- m / 1000 # convert distances to kilometers

# Covariance matrix
rho <- runif(1, 50, 600)
alpha <- runif(1, 0.2, 1.5)
K <- alpha^2 * exp(-0.5 * ((m / rho) ^ 2)) + diag(1e-9, N)

mu <- MASS::mvrnorm(1, mu = rep(0, N), Sigma = K)

ggplot(data.frame(GP = mu,
                  Longitude = data$longitude,
                  Latitude = data$latitude),
       aes(x = Longitude, y = Latitude, fill = GP)) +
  geom_point(size = 2, shape = 21, col = 'white') +
  scale_fill_continuous(type = "viridis") +
  theme_minimal()

alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts
beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area

eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) + mu # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot


ggplot(data.frame(Y = nb_dead,
                  Longitude = data$longitude,
                  Latitude = data$latitude),
       aes(x = Longitude, y = Latitude, fill = Y)) +
  geom_point(size = 3, shape = 21, col = 'white') +
  scale_fill_continuous(type = "viridis", name = "nb dead trees") +
  theme_minimal()

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, 
                data = datalist, 
                iter = 2000, chains = 4, cores = 4,init=0, 
                save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                pars=c("alpha_country","beta_GO","beta_C"),
                true_params = c(alpha_country,beta_GO,beta_C))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here("outputs/ValidationNFI/Simulations/sim_m3_withspatialautocorrelation.rds"))

The following table shows the coverage of the 80% and 95% credible intervals for the country-specific intercepts, \(\beta_0\) and \(\beta_{GO}\) coefficients from model M3 fitted on simulated data with autocorrelation:

Code
readRDS(file=here("outputs/ValidationNFI/Simulations/sim_m3_withspatialautocorrelation.rds")) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 7 13
alpha_country[2] 26 37
beta_C 63 76
beta_GO 67 88

2.7 M6: adding a third predictor

2.7.1 M6_1

Code
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m6_1.stan"))
print(stancode)
S4 class stanmodel 'anon_model' coded as follows:
data {
  int N;
  vector[N] log_nb_years;                                                           // Offset to account for different census intervals
  vector[N] C;                                                                      // Proxy of the competition among trees (i.e. basal area)
  vector[N] DBH;                                                                      // Proxy of the average tree age (i.e. mean DBH)
  vector[N] GO;                                                                     // Genomic offset
  int nb_dead[N];                                                                   // Number of dead trees in the plot
  int nb_tot[N];                                                                    // Total number of trees in the plot
  int<lower=0> nb_country;                                                          // Number of countries
  int<lower=0, upper=nb_country> country[N];                                        // Countries
}
parameters {
  vector[nb_country] alpha_country;
  real beta_GO;
  real beta_C;
  real beta_DBH;
}
model {
  nb_dead ~ binomial(nb_tot,inv_cloglog(alpha_country[country] + beta_GO * GO + beta_C * C  + beta_DBH * DBH + log_nb_years));

  alpha_country ~ normal(0, 1);
  beta_GO ~ normal(0, 1);
  beta_C ~ normal(0, 1);
  beta_DBH ~ normal(0, 1);
} 
Code
# Parameters to estimate
params_to_estimate <- c("alpha_country","beta_GO","beta_C", "beta_DBH")

2.7.1.1 Balanced design

Code
set.seed(4922424)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
nb_country <- 2
country <- c(rep(1,N/2),rep(2,N/2))

# Data simulation
lapply(nb_setseed,function(seed){
  
set.seed(seed)
  
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # basal area
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 DBH=DBH,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))


# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C,beta_DBH))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_balanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_balanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 75 95
alpha_country[2] 77 93
beta_C 77 91
beta_DBH 77 96
beta_GO 85 95

2.7.1.2 Unbalanced design

Code
set.seed(53)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_country <- 2 # nb of countries
country <- c(rep(1,N/2),rep(2,N/2))


# Data simulation
lapply(nb_setseed,function(seed){

set.seed(seed) 

nb_tot <-sample(5:30, N, replace=T)  # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)

alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH

GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 DBH=DBH,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C,beta_DBH))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_unbalanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_unbalanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 66 87
alpha_country[2] 80 92
beta_C 81 94
beta_DBH 73 90
beta_GO 75 93

2.7.1.3 True design

Code
set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Data simulation
lapply(nb_setseed, function(seed){
  
set.seed(seed)
  
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot  # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates

nb_country <- length(unique(data$country))
country <- as.numeric(data$country)
alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts

beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH

GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH

eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 DBH=DBH,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C,beta_DBH))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_truedesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_truedesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 84 95
alpha_country[2] 78 93
beta_C 78 98
beta_DBH 78 97
beta_GO 81 97

2.7.2 M6_2

Code
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m6_2.stan"))
print(stancode)
S4 class stanmodel 'anon_model' coded as follows:
data {
  int N;
  vector[N] log_nb_years;                                                           // Offset to account for different census intervals
  vector[N] C;                                                                      // Proxy of the competition among trees (i.e. basal area)
  vector[N] DBH;                                                                    // Proxy of the average tree age (i.e. mean DBH)
  vector[N] GO;                                                                     // Genomic offset
  int nb_dead[N];                                                                   // Number of dead trees in the plot
  int nb_tot[N];                                                                    // Total number of trees in the plot
  int<lower=0> nb_country;                                                          // Number of countries
  int<lower=0, upper=nb_country> country[N];                                        // Countries
}
parameters {
  vector[nb_country] alpha_country;
  real beta_GO;
  real beta_C;
  real beta_DBH;
  real beta_C_DBH;
}
model {
  nb_dead ~ binomial(nb_tot,inv_cloglog(alpha_country[country] + beta_GO * GO + beta_C * C  + beta_DBH * DBH + beta_C_DBH * C .* DBH + log_nb_years));

  alpha_country ~ normal(0, 1);
  beta_GO ~ normal(0, 1);
  beta_C ~ normal(0, 1);
  beta_DBH ~ normal(0, 1);
  beta_C_DBH ~ normal(0, 1);
} 
Code
# Parameters to estimate
params_to_estimate <- c("alpha_country","beta_GO","beta_C", "beta_DBH","beta_C_DBH")

2.7.2.1 Balanced design

Code
set.seed(49294)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
nb_country <- 2
country <- c(rep(1,N/2),rep(2,N/2))

# Data simulation
lapply(nb_setseed,function(seed){
  
set.seed(seed)
  
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
beta_C_DBH <- runif(1,-1.5,1.5) # coeff of the interaction between mean DBH and basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # basal area
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + beta_C_DBH * C * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 DBH=DBH,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))


# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C,beta_DBH,beta_C_DBH))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_balanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_balanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 75 90
alpha_country[2] 77 90
beta_C 68 88
beta_C_DBH 76 94
beta_DBH 75 89
beta_GO 77 92

2.7.2.2 Unbalanced design

Code
set.seed(53)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Experimental design
N <- 1000 # nb of plots
nb_country <- 2 # nb of countries
country <- c(rep(1,N/2),rep(2,N/2))


# Data simulation
lapply(nb_setseed,function(seed){

set.seed(seed) 

nb_tot <-sample(5:30, N, replace=T)  # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)

alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
beta_C_DBH <- runif(1,-1.5,1.5) # coeff of the interaction between mean DBH and basal area

GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + beta_C_DBH * C * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 DBH=DBH,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C,beta_DBH,beta_C_DBH))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_unbalanceddesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_unbalanceddesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 68 87
alpha_country[2] 67 93
beta_C 77 95
beta_C_DBH 74 94
beta_DBH 69 91
beta_GO 72 91

2.7.2.3 True design

Code
set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)

# Data simulation
lapply(nb_setseed, function(seed){
  
set.seed(seed)
  
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot  # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates

nb_country <- length(unique(data$country))
country <- as.numeric(data$country)
alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts

beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
beta_C_DBH <- runif(1,-1.5,1.5) # coeff of the interaction between mean DBH and basal area

GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH

eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + beta_C_DBH * C * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot

# Stan list
datalist <- list(N=N,
                 nb_dead=nb_dead,
                 nb_tot=nb_tot,
                 GO=GO,
                 C=C,
                 DBH=DBH,
                 nb_country=nb_country,
                 country=country,
                 log_nb_years=log(nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Build coeff table
make_coeff_table(mod,
                 pars = params_to_estimate,
                 true_params = c(alpha_country,beta_GO,beta_C,beta_DBH,beta_C_DBH))

}) %>% 
  bind_rows(.id = "nb_sim") %>% 
  saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_truedesign.rds")))
Code
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_truedesign.rds"))) %>% 
  calc_coverage_across_sims %>% 
  kable_mydf()
parameter conf80_coverage conf95_coverage
alpha_country[1] 83 95
alpha_country[2] 72 91
beta_C 79 95
beta_C_DBH 80 99
beta_DBH 78 94
beta_GO 87 98

2.7.2.4 True data with random GO

In this set of simulations, we use the NFI data but we replace the genomic offset predictions at the location of the NFI plots by a randomly generated variable such as \(x \sim \mathcal{N}(0,1)\).

Code
run_randomGO <- function(stancode,data){

random_GO <- rnorm(nrow(data),mean=0,sd=1) 

datalist <- list(N=nrow(data),
                 nb_dead=data$nb_dead,
                 nb_tot=data$nb_tot,
                 GO=(random_GO - mean(random_GO)) / sd(random_GO),
                 C=(data$basal_area-mean(data$basal_area))/sd(data$basal_area),
                 DBH=(data$mean_DBH - mean(data$mean_DBH))/sd(data$mean_DBH),
                 nb_country=length(unique(data$country)),
                 country=as.numeric(data$country),
                 log_nb_years=log(data$nb_years))

# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE) 

# Model coefficients
conf95 <- broom.mixed::tidyMCMC(mod,
                                pars=params_to_estimate,
                                droppars = NULL, 
                                robust = FALSE, # give mean and standard deviation
                                ess = T, # effective sample size estimates
                                rhat = T, # Rhat estimates
                                conf.int = T, conf.level = 0.95 # include 95% credible intervals
                                ) %>% 
    dplyr::rename(conf95_low=conf.low,
                  conf95_high=conf.high,
                  mean=estimate,
                  std_deviation=std.error)
  
  
  broom.mixed::tidyMCMC(mod,
                        pars=params_to_estimate,
                        droppars = NULL, 
                        robust = TRUE, # give median and mean absolute deviation (= avg distance btw each data point and the mean)
                        ess = F, rhat = F, 
                        conf.int = T, conf.level = 0.80 # include 80% credible intervals
                        ) %>% 
    dplyr::rename(conf80_low=conf.low,
                  conf80_high=conf.high,
                  median=estimate,
                  mean_absolute_deviation=std.error) %>% 
     inner_join(conf95,by=c("term")) %>% 
     dplyr::select(term,
                   mean, mean_absolute_deviation,
                   median, std_deviation,
                   conf80_low,conf80_high,
                   conf95_low,conf95_high, 
                   rhat,ess)
   

}

set.seed(4935)

lapply(1:nb_simulations, function(x) run_randomGO(stancode,data)) %>% saveRDS(here(paste0("outputs/ValidationNFI/Simulations/m6_2_truedata_randomGO_",nb_simulations,"simulations.rds")))
Code
# Function to count the number of simulations in which the 95% and 80% credible intervals overlap with zero. 
count_overlapp_with_zero_across_sims <- function(x){

  x %>% 
  group_by(term) %>% 
  group_split() %>% 
  purrr::map(\(x){
    data.frame(parameter=unique(x$term),
               conf80_include_zero=sum(x$conf80_include_zero == TRUE),
               conf95_include_zero=sum(x$conf95_include_zero == TRUE))
    }) %>% 
    bind_rows()}

# Extract the outputs and identify the simulations in which the 95% and 80% credible intervals overlap with zero. 
df <- readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/m6_2_truedata_randomGO_",nb_simulations,"simulations.rds"))) %>% 
  bind_rows(.id="sim_ID") %>% 
  dplyr::filter(term == "beta_GO") %>%
  mutate(conf80_include_zero = if_else(conf80_low<=0 & conf80_high>=0, TRUE,FALSE),
         conf95_include_zero = if_else(conf95_low<=0 & conf95_high>=0, TRUE,FALSE))

# How many times the 95% and 80% credible intervals include zero?
df %>% 
  count_overlapp_with_zero_across_sims %>% 
  kable_mydf()
parameter conf80_include_zero conf95_include_zero
beta_GO 34 51
Code
# Show the mean and 95% / 80% credible intervals with interval plots
df %>% 
  ggplot(aes(mean, reorder(sim_ID, mean))) +
  geom_vline(xintercept = 0, color="gray30") +
  geom_errorbar(aes(xmin = conf95_low, xmax = conf95_high), color="#009E73") +
  geom_errorbar(aes(xmin = conf80_low, xmax = conf80_high), color="orange") +
  geom_point() +
  theme_bw() +
  labs(x="Effect size of a randomly generated variable", y="ID of the simulations") +
  theme(axis.text.y = element_text(size=6))